Goto

Collaborating Authors

 text output


Revealing economic facts: LLMs know more than they say

arXiv.org Artificial Intelligence

During training, generative large language models (LLMs) are exposed to vast amounts of information, including data relevant to economic modelling, such as geospatial statistics and firm-level financial metrics. If LLMs can effectively retrieve and utilise this knowledge, they could reduce dependence on external data sources that are time-consuming to access, clean, and merge, or that incur financial costs. Moreover, if LLMs accurately represent data, they could support downstream tasks like data imputation and outlier detection. In this study, we evaluate whether and how LLMs can be used for typical economic data processes. Not all knowledge within an LLM may be explicit and retrievable in natural language by prompting the model.


A Taxonomy of Linguistic Expressions That Contribute To Anthropomorphism of Language Technologies

arXiv.org Artificial Intelligence

Recent attention to anthropomorphism -- the attribution of human-like qualities to non-human objects or entities -- of language technologies like LLMs has sparked renewed discussions about potential negative impacts of anthropomorphism. To productively discuss the impacts of this anthropomorphism and in what contexts it is appropriate, we need a shared vocabulary for the vast variety of ways that language can be anthropomorphic. In this work, we draw on existing literature and analyze empirical cases of user interactions with language technologies to develop a taxonomy of textual expressions that can contribute to anthropomorphism. We highlight challenges and tensions involved in understanding linguistic anthropomorphism, such as how all language is fundamentally human and how efforts to characterize and shift perceptions of humanness in machines can also dehumanize certain humans. We discuss ways that our taxonomy supports more precise and effective discussions of and decisions about anthropomorphism of language technologies.


GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

arXiv.org Artificial Intelligence

We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality.


"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

arXiv.org Artificial Intelligence

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model's diverse response styles such as starting with "Sure" or refusing to answer. Consequently, MCQ evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.


Phase Transitions in the Output Distribution of Large Language Models

arXiv.org Artificial Intelligence

In a physical system, changing parameters such as temperature can induce a phase transition: an abrupt change from one state of matter to another. Analogous phenomena have recently been observed in large language models. Typically, the task of identifying phase transitions requires human analysis and some prior understanding of the system to narrow down which low-dimensional properties to monitor and analyze. Statistical methods for the automated detection of phase transitions from data have recently been proposed within the physics community. These methods are largely system agnostic and, as shown here, can be adapted to study the behavior of large language models. In particular, we quantify distributional changes in the generated output via statistical distances, which can be efficiently estimated with access to the probability distribution over next-tokens. This versatile approach is capable of discovering new phases of behavior and unexplored transitions -- an ability that is particularly exciting in light of the rapid development of language models and their emergent capabilities.


From Text to Pixel: Advancing Long-Context Understanding in MLLMs

arXiv.org Artificial Intelligence

The rapid progress in Multimodal Large Language Models (MLLMs) has significantly advanced their ability to process and understand complex visual and textual information. However, the integration of multiple images and extensive textual contexts remains a challenge due to the inherent limitation of the models' capacity to handle long input sequences efficiently. In this paper, we introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images, enabling the model to handle long text within a fixed token-length budget efficiently. Our empirical experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach, and is more efficient in understanding long-form multimodal input and generating long-form textual output, outperforming all existing proprietary and open-source MLLMs by large margins.


Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

arXiv.org Artificial Intelligence

SER models built on such limited sized corpora don't generalize Alongside acoustic information, linguistic features based on well to out-of-domain speech. Second, while previous studies speech transcripts have been proven useful in Speech Emotion proposed using ASR to generate transcripts for SER [5], ASR Recognition (SER). However, due to the scarcity of on emotional speech can often result in relatively high error emotion labelled data and the difficulty of recognizing emotional rates. Previous research has shown that emotion in speech speech, it is hard to obtain reliable linguistic features degrades ASR performance, with emotional speech assumed and models in this research area. In this paper, we propose to be a distortion of neutral speech [6]. However, with the to fuse Automatic Speech Recognition (ASR) outputs into advancement of deep learning technologies, transfer learning the pipeline for joint training SER. The relationship between for SER from ASR and joint training of ASR and SER have ASR and SER is understudied, and it is unclear what and recently emerged [7, 8]. Nevertheless, the relationship between how ASR features benefit SER. By examining various ASR ASR and SER is still poorly studied, particularly what outputs and fusion methods, our experiments show that in and how ASR features can benefit SER.


Getting AI to Take my Notes for Me

#artificialintelligence

In our daily lives, we are constantly being bombarded by data, but here's the catch. Not all of it is digital. On average, we hear 30,000 words and speak at least 7000 daily. As we all know, not every single word is important, and to track what is, we try taking notes. However, we suck at knowing about what's going to come in handy, and this results in us taking down too many notes.


Evaluating Text Output in NLP: BLEU at your own risk

#artificialintelligence

One question I get fairly often from folks who are just getting into NLP is how to evaluate systems when the output of that system is text, rather than some sort of classification of the input text. These types of problems, where you put some text into your model and get some other text out of it, are known as sequence to sequence or string transduction problems. This sort of technology is right out of science fiction. With such a wide range of exciting applications, it's easy to see why sequence to sequence modeling is more popular than ever. What's not easy is actually evaluating these systems. Unfortunately for folks who are just getting started, there's no simple answer about what metric you should use to evaluate your model. Even worse, one of the most popular metrics for evaluating sequence to sequence tasks, BLEU, has major drawbacks, especially when applied to tasks that it was never intended to evaluate.